TREE-STRUCTURED STICK BREAKING PROCESSES 
FOR HIERARCHICAL DATA 



By Ryan P. Adams, Zoubin Ghahramani and Michael I. Jordan 

Many data are naturally modeled by an unobserved hierarchical 
structure. In this paper we propose a flexible nonparametric prior over 
unknown data hierarchies. The approach uses nested stick-breaking 
processes to allow for trees of unbounded width and depth, where data 
can live at any node and are infinitely exchangeable. One can view 
our model as providing infinite mixtures where the components have a 
dependency structure corresponding to an evolutionary diffusion down 
a tree. By using a stick-breaking approach, we can apply Markov chain 
Monte Carlo methods based on slice sampling to perform Bayesian 
inference and simulate from the posterior distribution on trees. We 
apply our method to hierarchical clustering of images and topic 
modeling of text data. 

1. Introduction. Structural aspects of models are often critical to ob- 
taining flexible, expressive model families. In many cases, however, the 
structure is unobserved and must be inferred, either as an end in itself or 
to assist in other estimation and prediction tasks. This paper addresses an 
important instance of the structure learning problem: the case when the 
data arise from a latent hierarchy. We take a direct nonparametric Bayesian 
approach, constructing a prior on tree-structured partitions of data that pro- 
vides for unbounded width and depth while still allowing tractable posterior 
inference. 

Probabilistic approaches to latent hierarchies have been explored in a 
variety of domains. Unsupervised learning of densities and nested mixtures 
has received particular attention via flnite-depth trees [Williams, 2000], 
diffusive branching processes [Neal, 2003a] and hierarchical clustering [Heller 
and Ghahramani, 2005, Teh et al., 2007]. Bayesian approaches to learning 
latent hierarchies have also been useful for semi-supervised learning [Kemp 
et al., 2004], relational learning [Roy et al., 2007] and multi-task learning 
[Daume III, 2009]. In the vision community, distributions over trees have 
been useful as priors for figure motion [Meeds et al., 2008] and for discovering 
visual taxonomies [Bart et al., 2008]. 

In this paper we develop a distribution over probability measures that 
imbues them with a natural hierarchy. These hierarchies have unbounded 
width and depth and the data may live at internal nodes on the tree. As the 
process is defined in terms of a distribution over probability measures and 
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not as a distribution over data per se, data from this model are infinitely 
exchangeable; the probability of any set of data is not dependent on its 
ordering. Unlike other infinitely exchangeable models [Neal, 2003a, Teh et al., 
2007], a pseudo-time process is not required to describe the distribution 
on trees and it can be understood in terms of other popular Bayesian 
nonparametric models. 

Our new approach allows the components of an infinite mixture model 
to be interpreted as part of a diffusive evolutionary process. Such a process 
captures the natural structure of many data. For example, some scientific 
papers are considered seminal — they spawn new areas of research and cause 
new papers to be written. We might expect that within a text corpus of 
scientific documents, such papers would be the natural ancestors of more 
specialized papers that followed on from the new ideas. This motivates two 
desirable features of a distribution over hierarchies: 1) ancestor data (the 
"prototypes") should be able to live at internal nodes in the tree, and 2) as the 
ancestor/descendant relationships are not known a priori^ the data should 
be infinitely exchangeable. 

2. A Tree-Structured Stick-Breaking Process. Stick-breaking pro- 
cesses based on the beta distribution have played a prominent role in the 
development of Bayesian nonparametric methods, most significantly with 
the constructive approach to the Dirichlet process (DP) due to Sethuraman 
[1994]. A random probability measure G can be drawn from a DP with base 
measure aH using a sequence of beta variates via: 

oo i—1 

(1) G = ^7ri60. vr^ = JJ(1 - z^zO 

i=l i'=l 

Oi ^ H i/i Be(l, Q{) TTi = ui. 

We can view this as taking a stick of unit length and breaking it at a random 
location. We call the left side of the stick tti and then break the right side 
again at a new place, calling the left side of this new break 7r2. If we continue 
this process of "keep the left piece and break the right piece again" as in 
Fig. la, assigning each tt^ a random value drawn from iJ, we can view this 
is a random probability measure centered on H. The distribution over the 
sequence (tti, 7r2, • • • ) is a case of the GEM distribution [Pitman, 2002], which 
also includes the Pitman- Yor process [Pitman and Yor, 1997]. Note that in 
Eq. (1) the 0i are i.i.d. from H] in the current paper these parameters will 
be drawn according to a hierarchical process. 
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(a) Dirichlet process stick breaking 
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(b) Tree-structured stick breaking 



Fig 1: a) Dirichlet process stick-breaking procedure, with a hnear partitioning, 
b) Interleaving two stick-breaking processes yields a tree-structured partition. 
Rows 1, 3 and 5 are z^-breaks. Rows 2 and 4 are '^/^-breaks. 

The GEM construction provides a distribution over infinite partitions of 
the unit interval, with natural numbers as the index set as in Fig. la. In this 
paper, we extend this idea to create a distribution over infinite partitions 
that also possess a hierarchical graph topology. To do this, we will use 
finite-length sequences of natural numbers as our index set on the partitions. 
Borrowing notation from the Polya tree (PT) construction [Mauldin et al., 
1992], let 6= (ei, 62, • • • , 6^), denote a length-K sequence of positive integers, 
i.e., ej[;^GN+. We denote the zero-length string as e = and use |e| to indicate 
the length of e's sequence. These strings will index the nodes in the tree 
and |e| will then be the depth of node e. 

We interleave two stick-breaking procedures as in Fig. lb. The first has 
beta variates z/^ ^Be(l, Q;(|e|)) which determine the size of a given node's 
partition as a function of depth. The second has beta variates '0e^Be(l,7), 
which determine the branching probabilities. Interleaving these processes 
partitions the unit interval. The size of the partition associated with each e 
is given by 



i-l 



(2) = 




where eci denotes the sequence that results from appending onto the end 
of e, and e'^e indicates that e could be constructed by appending onto e^ 
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When viewing these strings as identifying nodes on a tree, {eci : G 1, 2, • • • } 
are the children of e and {e' : e'-<e} are the ancestors of e. The {tTc} in 
Eq. (2) can be seen as products of several decisions on how to allocate mass 
to nodes and branches in the tree: the {(fe} determine the probability of a 
particular sequence of children and the and (1 — z/^) terms determine the 
proportion of mass allotted to e versus nodes that are descendants of e. 

We require that the {tv^} sum to one. The '^-sticks have no effect upon 
this, but a(-) : N ^ (the depth- varying parameter for the z/-sticks) must 
satisfy ^^^^ ln(l + l/a(j — 1)) = +oc (see Ishwaran and James [2001]). This 
is clearly true for a{j) = ao > 0. A useful function that also satisfies this 
condition is a{j) = ao with ao>0, AG (0, 1]. The decay parameter A allows 
a distribution over trees with most of the mass at an intermediate depth. 
This is the a(-) we will assume throughout the remainder of the paper. 

An Urn-based View. When a Bayesian nonparametric model induces 
partitions over data, it is sometimes possible to construct an urn scheme 
that corresponds to sequentially generating data, while integrating out the 
underlying random measure. The "Chinese restaurant" metaphor for the 
Dirichlet process is a popular example. In our model, we can use such an urn 
scheme to construct a treed partition over a finite set of data. Note that while 
the tree illustrated in Fig. lb is a nested set of size-biased partitions, the 
ordering of the branches in an urn-based tree over data does not necessarily 
correspond to a size-biased permutation [Pitman, 1996]. 

The data drawing process can be seen as a path-reinforcing Bernoulli trip 
down the tree where each datum starts at the root and descends into children 
until it stops at some node. The first datum lives at the root node with proba- 
bility l/(a(0)H-l), otherwise it descends and instantiates a new child. It stays 
at this new child with probability l/(a(l) + l) or descends again and so on. A 
later datum stays at node e with probability {Ne + l)/{Ne+Ne^. + a{\e\) + l)^ 
where is the number of previous data that stopped at e, and N^^. is 
the number of previous data that came down this path of the tree but did 
not stop at €, i.e., a sum over all descendants: Ne^. = J2e^e' ^e'- If ^ datum 
descends to e but does not stop then it chooses which child to descend to 
according to a Chinese restaurant process where the previous customers 
are only those data who have also descended to this point. That is, if it 
has reached node e but will not stay there, it descends to existing child eci 
with probability (N^^.-\-N^^.^.)/(N^^.+^) and instantiates a new child with 
probability j/{Ne^.-\-j). A particular path therefore becomes more likely 
according to its "popularity" with previous data. Note that a node can be a 
part of a popular path without having any data of its own. Fig. ?? shows 
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the structures implied over fifty data drawn from this process with different 
hyperparameter settings. 

The urn view allows us to place this model into the literature on priors 
on infinite trees. One of the main contributions of this model is that the 
data can live at arbitrary internal nodes in the tree, but are nevertheless 
infinitely exchangeable. This is in contrast to the model proposed by Meeds 
et al. [2008], for example, which is not infinitely exchangeable. The nested 
Chinese restaurant process (nCRP) [Blei et al., 2010] provides a distribution 
over trees of unbounded width and depth, but data correspond to reinforcing 
paths of infinite length, requiring an additional distribution over depths that 
is not path-dependent. The Polya tree [Mauldin et al., 1992] uses a recursive 
stick-breaking process to specify a distribution over nested partitions in a 
binary tree, however the resulting data live at the infinitely-deep leaf nodes. 
The marginal distribution over the topology of a Dirichlet diffusion tree 
[Neal, 2003a] (and the clustering variant of Kingman's coalescent proposed 
by Teh et al. [2007]) provides path-reinforcement and infinite exchangeability, 
however the topology is determined by a hazard process in pseudo-time and 
data do not live at internal nodes. 

3. Hierarchical Priors for Node Parameters. In the stick-breaking 
construction of the Dirichlet process one can view the procedure as generating 
an infinite partition and then labeling each cell i with parameter 6i drawn i.i.d. 
from H. In a mixture model, data that are drawn from the zth component 
are generated independently according to a distribution f(x\6i), where x 
takes values in a sample space X. In our model, we continue to assume 
that the data are generated independently given the latent labeling, but 
to take advantage of the tree-structured partitioning of Section 2 an i.i.d. 
assumption on the node parameters is inappropriate. Rather, the distribution 
over the parameters at node e, denoted 9e^ should depend in an interesting 
way on its ancestors {9^' : e'^e}. A natural and powerful way to specify 
such dependency is via a directed graphical model, with the requirement 
that edges must always point down the tree. An intuitive subclass of such 
graphical models are those in which a child is conditionally independent of all 
ancestors, given its parents and any global hyperparameters. This is the case 
we will focus on here, as it provides a useful view of the parameter-generation 
process as a "diffusion down the tree" via a Markov transition kernel that 
can be essentially any distribution with a location parameter. Coupling such 
a kernel, which we denote T(9eei ^^e)-, with a root-level prior ^(^0) and the 
node- wise data distribution /(x | 0^), we have a complete model for infinitely 
exchangeable tree- structured data on X. We now examine a few specific 
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(a) ao = l, A=|, 7=1 
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(b) ao = l, A = l, 



(c) q;o = 1, A = 1, 7 = 1 
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(d) ao=5, A=|,7=| 



(e) ao = 5, A = l, 7= | 



(f) ao=5, A=|, 7 = 1 




^^-■-■-■-■O-B ^■■■■■■B l'^** 



(g) ao = 25,X=l^=l 



(h) ao=25, A=|, 7 = 1 




Fig 2: Eight samples of trees over partitions of fifty data, with different 
hyperparameter settings. The circles are represented nodes, and the squares 
are the data. Note that some of the sampled trees have represented nodes 
with no data associated with them and that the branch ordering does not 
correspond to a size-biased permutation. 



examples. 

Generalized Gaussian Diffusions. If our data distribution f(x\9) is such 
that the parameters can be specified as a real- valued vector 6>gM^, then 
we can use a Gaussian distribution to describe the parent-to-child transition 
kernel: rnorm(^eei ^^e) =J^{Oeei 1 7^ ^e, A), whcrc 7^ G [0, 1). Such a kernel cap- 
tures the simple idea that the child's parameters are noisy versions of the 
parent's, as specified by the covariance matrix A, while rj ensures that all 
parameters in the tree have a finite marginal variance. While this will not 
result in a conjugate model unless the data are themselves Gaussian, it has 
the simple property that each node's parameter has a Gaussian prior that is 
specified by its parent. We present an application of this model in Section 5, 



TREE-STRUCTURED STICK BREAKING PROCESSES 



7 



where we model images as a distribution over binary vectors obtained by 
transforming a real- valued vector to (0, 1) via the logistic function. 

Chained Dirichlet- Multinomial Distributions. If each datum is a set of 
counts over M discrete outcomes, as in many finite topic models, a multi- 
nomial model for f(x \ 6) may be appropriate. In this case, X — f^^^ and 6^ 
takes values in the (M — l)-simplex. We can construct a parent-to-child 
transition kernel via a Dirichlet distribution with concentration parame- 
ter k: T^\^{9eei'^0e)=^vc{i^0e)^ usiug a symmetric Dirichlet for the root 
node, i.e., 60^Dii{i<il). 

Hierarchical Dirichlet Processes. A very general way to specify the distribu- 
tion over data is to say that it is a random probability measure drawn from 
a Dirichlet process. In our case, a very flexible model would say that the 
data drawn at node e are from a distribution Ge as in Eq. (1). This means 
that 9e ^ Ge where 9^, now corresponds to an infinite set of parameters. The 
hierarchical Dirichlet process (HDP) [Teh et al., 2006] provides a natural 
parent-to-child transition kernel for the tree-structured model, again with 
concentration parameter k: rhdp(Geei ^Ge) = DP(/^Ge)- At the top level, we 
specify a global base measure H for the root node, i.e., G0^H. One negative 
aspect of this transition kernel is that the Ge will have a tendency to collapse 
down onto a single atom. One remedy is to smooth the kernel with 77 as in 
the Gaussian case, i.e., T\^(\^{Geei ^G^) = DP(/^ (77 + (1 — rf) H)). 

4. Inference via Markov chain Monte Carlo. We have so far de- 
fined a model for data that are generated from the parameters associated 
with the nodes of a random tree. Having seen data points and assuming 
a model f{x \ 9e) as in the previous section, we wish to infer possible trees 
and model parameters. As in most complex probabilistic models, closed form 
inference is impossible and we instead perform inference by generating poste- 
rior samples via Markov chain Monte Carlo (MCMC). To operate efficiently 
over a variety of regimes without tuning, we use slice sampling [Neal, 2003b] 
extensively. This allows us to sample from the true posterior distribution 
over the finite quantities of interest despite the fact that our our model 
technically contains an infinite number of parameters. The primary data 
structure in our Markov chain is the set of N strings describing the current 
assignments of data to nodes, which we denote {en}n=i' We represent the 
i^-sticks and parameters 9e for all nodes that are traversed by the data in its 
current assignments, i.e., {ve^^e • 3n, e^e^}. We additionally represent all 
'0-sticks in the "hull" of the tree that contains the data: if at some node e 
one of the A^ data paths passes through child 66^, then we represent all 
the V^-sticks in the set |J^^ (J^^.^^^jV^ee^. : ej<ei}. We also sample from the 
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function SAMP-ASSIGNMENT(n) 
Psiice ~ Uni(0, f{Xn I Oe^)) 
Ufr\\n 0, Umax 1 

loop 

U ~ Uni(l/min,'^max) 
€ ^ FIND-NODE(lt, 0) 
P ^ f{Xn I Oe) 

ifp> psiice then return e 
else if e<en then Umin-^u 

else l^max ^ u 



function FiND-NODE(it, e) 
if It < i/e then return e 
else 



(U- Z^e)/(1 - l^e) 

while i^<l-nj(l-^ee,) do 



Draw a new i/^-stick 
e ^ edges from '0-sticks 
i ^ bin index for u from edges 
Draw Oeci and Veei if necessary 
^1 ^ (-u - ei)/(ei+i - e^) 
return find-node(2/, eci) 



function size-biased-perm(€) 

while represented children do 
w ^ weights from {V^ee^} 
w ^ w\p 

j W 

p ^ append j 
return p 



hyperparameters 7? ^iid A for the tree and any parameters associated 
with the hkehhoods. 

Slice Sampling Data Assignments. The primary chahenge in inference with 
Bayesian nonparametric mixture models is often samphng from the posterior 
distribution over assignments, as it is frequently difficult to integrate over the 
infinity of unrepresented components. To avoid this difficulty, we use a slice 
sampling approach that can be viewed as a combination of the Dirichlet slice 
sampler of Walker [2007] and the retrospective sampler of Papaspiliopoulos 
and Roberts [2008]. 

Section 2 described a path-reinforcing process for generating data from 
the model. An alternative method is to draw a uniform variate u on (0, 1) 
and break sticks until we know what tt^ the u fell into. One can imagine 
throwing a dart at the top of Fig. lb and considering which it hits. We 
would draw the sticks and parameters from the prior, as needed, conditioning 
on the state instantiated from any previous draws and with parent-to-child 
transitions enforcing the prior downwards in the tree. Calling the pseudocode 
function FiND-NODE(t^, e) with t^~Uni(0, 1) and € = draws such a sample. 
This provides a retrospective slice sampling scheme on t^, allowing us to draw 
posterior samples without having to specify any tuning parameters. 

To slice sample the assignment of the nth datum, currently assigned to e^, 
we initialize our slice sampling bounds to (0, 1). We draw a new u from the 
bounds and use the FIND-NODE function to determine the associated e from 
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the currently-represented state, plus any additional state that must be drawn 
from the prior. We do a lexical comparison ( "string- like" ) of the new e and 
our current state e^, to determine whether this new path corresponds to 
a u that is "above" or "below" our current state. This lexical comparison 
prevents us from having to represent the initial u^. We shrink the slice 
sampling bounds appropriately, depending on the result of the comparison, 
until we find a u whose assignment satisfies the slice. This procedure is given 
in pseudocode as SAMP-ASSlGNMENT(n). After performing this procedure, 
we can discard any state that is not in the previously-mentioned hull of 
representation. 

Gibbs Sampling Stick Lengths. Given the represented sticks and the current 
assignments of nodes to data, it is straightforward to resample the lengths of 
the sticks from the posterior beta distributions 

Ue I data Be{Ne + l, Ne^. + a{\e\)) 

V^eeJdata-Be(iVee,^. + l,7+Ei>i^6e,-^.). 

where Ne and N^^. are the path-based counts as described in Section 2. 

Gibbs Sampling the Ordering of the ijj-Sticks. When using the stick-breaking 
representation of the Dirichlet process, it is crucial for mixing to sample 
over possible orderings of the sticks. In our model, we include such moves on 
the 'i/^-sticks. We iterate over each instantiated node e and perform a Gibbs 
update of the ordering of its immediate children using its invariance under 
size-biased permutation (SBP) [Pitman, 1996]. For a given node, the '0-sticks 
provide a "local" set of weights that sum to one. We repeatedly draw without 
replacement from the discrete distribution implied by the weights and keep 
the ordering that results. Pitman [Pitman, 1996] showed that distributions 
over sequences such as our '^-sticks are invariant under such permutations and 
we can view the size-biased-perm(€) procedure as a Metropolis-Hastings 
proposal with an acceptance ratio that is always one. 

Slice Sampling Stick- Breaking Hyperparameters. Given all of the instanti- 
ated sticks, we slice sample from the conditional posterior distribution over 
the hyperparameters a^^ A and 7: 

€ 

I m) oc 1(7"'" <7<7"^'^) n^e^^- 1 1'^)' 

€ 

where the products are over nodes in the aforementioned hull. We initialize 
the bounds of the slice sampler with the bounds of the top-hat prior. 
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Selecting a Single Tree. We have so far described a procedure for gen- 
erating posterior samples from the tree structures and associated stick- 
breaking processes. If our objective is to find a single tree, however, sam- 
ples from the posterior distribution are unsatisfying. Following Blei et al. 
[2010], we report a best single tree structure over the data by choosing the 
sample from our Markov chain that has the highest complete-data likeli- 
hood p{{Xn, en}n=l \ {^e}, {V^e}, Q^O, A, 7). 

5. Hierarchical Clustering of Images. We applied our model and 
MCMC inference to the problem of hierarchically clustering the CIFAR-100 
image data set ^. These data are a labeled subset of the 80 million tiny images 
dataset [Torralba et al, 2008] with 50,000 32 x 32 color images. We did not 
use the labels in our clustering. We modeled the images via 256-dimensional 
binary features that had been extracted from each image (i.e., Xn G {0, 1}^^^) 
using a deep neural network that had been trained for an image retrieval task 
[Krizhevsky, 2009]. We used a factored Bernoulli likelihood at each node, 
parameterized by a latent 256-dimensional real vector (i.e., 9e G M^^^) that 
was transformed component-wise via the logistic function: 

256 _ (d) (d) 

/(-n|^.) = n(l+-P{-^^'^})'" (l + e-p{#}) . 

d=l 

The prior over the parameters of a child node was Gaussian with its par- 
ent's value as the mean. The covariance of the prior (A in Section 3) was 
diagonal and inferred as part of the Markov chain. We placed indepen- 
dent Uni(0.01, 1) priors on the elements of the diagonal. To efficiently learn 
the node parameters, we used Hamiltonian (hybrid) Monte Carlo (HMC) 
[Duane et al., 1987, Neal, 1993], taking 25 leapfrog HMC steps, with a 
randomized step size. We occasionally interleaved a slice sampling move 
for robustness. For the stick-breaking processes, we used ao '^Uni(10, 50), 
A^Uni(0.05, 0.8), and 7^Uni(l, 10). Using Python on a single core of a 
modern workstation each MCMC sweep of the entire model (including slice 
sampled reassignment of all 50,000 images) requires approximately three 
minutes. Fig. 3 represents a part of the tree with the best complete-data log 
likelihood after 4000 iterations. The tree provides a useful visualization of 
the data set, capturing broad variations in color at the higher levels of the 
tree, with lower branches varying in texture and shape. A larger version of 
this tree is provided in the supplementary material. 



^http : //www . cs . utoronto . ca/~kriz/cif ar . html 
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Fig 3: These figures show a subset of the tree learned from the 50,000 CIFAR- 
100 images. The top tree only shows nodes for which there were at least 250 
images. The ten shown at each node are those with the highest probability 
under the node's distribution. The second row shows three expanded views of 
subtrees, with nodes that have at least 50 images. Detailed views of portions 
of these subtrees are shown in the third row. 



6. Hierarchical Modeling of Document Topics. We also used our 
approach in a bag-of- words topic model, applying it to 1740 papers from 
NIPS 1-12 2. As in latent Dirichlet allocation (LDA) [Blei et al., 2003], we 
consider a topic to be a distribution over words and each document to be 
described by a distribution over topics. In LDA, each document has a unique 
topic distribution. In our model, however, each document lives at a node and 
that node has a unique topic distribution. Thus multiple documents share 
a distribution over topics if they inhabit the same node. Each node's topic 



^http : //cs . nyu . edu/~roweis/data . html 
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Fig 4: A subtree of documents from NIPS 1-12, inferred using 20 topics. 
Only nodes with at least 50 documents are shown. Each node shows three 
aggregated statistics at that node: the five most common author names, the 
five most common words and a histogram over the years of proceedings. 



distribution is from a chained Dirichlet-multinomial as described in Section 3. 
The topics each have symmetric Dirichlet priors over their word distributions. 
This results in a different kind of topic model than that provided by the 
nested Chinese restaurant process. In the nCRP, each node corresponds to a 
topic and documents are infinitely-long paths down the tree. Each word is 
drawn from a distribution over depths that is given by a GEM distribution. 
In the nCRP, it is not the documents that have the hierarchy, but the topics. 

We did two kinds of analyses. The first is a visuahzation as with the image 
data of the previous section, using all 1740 documents. The subtree in Fig. 4 
shows the nodes that had at least fifty documents, along with the most 
common authors and words at that node. The normalized histogram in each 
box shows which of the twelve years are represented among the documents in 
that node. An expanded version of this tree is provided in the supplementary 
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Number of Topics Folds 



(a) Improvement versus multinomial, by num- (b) Best perplexity per word, by folds 
ber of topics 

Fig 5: Results of predictive performance comparison between latent Dirichlet 
allocation (LDA) and tree-structured stick breaking (TSSB). a) Mean im- 
provement in perplexity per word over Laplace-smoothed multinomial, as a 
function of topics (larger is better). The error bars show the standard devia- 
tion of the improvement across the ten folds, b) Best predictive perplexity 
per word for each fold (smaller is better). The numbers above the LDA and 
TSSB bars show how many topics were used to achieve this. 

material. Secondly, we quantitatively assessed the predictive performance of 
the model. We created ten random partitions of the NIPS corpus into 1200 
training and 540 test documents. We then performed inference with different 
numbers of topics (10, 20, ... , 100) and evaluated the predictive perplexity of 
the held-out data using an empirical likelihood estimate taken from a mixture 
of multinomials (pseudo-documents of infinite length, see, e.g. Wallach et al. 
[2009]) with 100,000 components. As Fig. 5a shows, our model improves in 
performance over standard LDA for smaller numbers of topics. We believe 
this improvement is due to the constraints on possible topic distributions 
that are imposed by the diffusion. For larger numbers of topics, however, 
it seems that these constraints become a hindrance and the model may be 
allocating predictive mass to regions where it is not warranted. In absolute 
terms, more topics did not appear to improve predictive performance for 
LDA or the tree-structured model. Both models performed best with fewer 
than fifty topics and the best tree model outperformed the best LDA model 
on all folds, as shown in Fig. 5b. 

The MCMC inference procedure we used to train our model was as 
follows: first, we ran Gibbs sampling of a standard LDA topic model for 
1000 iterations. We then burned in the tree inference for 500 iterations with 
fixed word-topic associations. We then allowed the word-topic associations 
to vary and burned in for an additional 500 iterations, before drawing 5000 
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samples from the full posterior. For the comparison, we burned in LDA for 
1000 iterations and then drew 5000 samples from the posterior [Griffiths 
and Steyvers, 2004]. For both models we thinned the samples by a factor of 
50. The mixing of the topic model seems to be somewhat sensitive to the 
initialization of the k parameter in the chained Dirichlet-multinomial and 
we initialized this parameter to be the same as the number of topics. 

7. Discussion. We have presented a model for a distribution over ran- 
dom measures that also constructs a hierarchy, with the goal of constructing 
a general-purpose prior on tree-structured data. Our approach is novel in 
that it combines infinite exchangeability with a representation that allows 
data to live at internal nodes on the tree, without a hazard rate process. We 
have developed a practical inference approach based on Markov chain Monte 
Carlo and demonstrated it on two real-world data sets in different domains. 

The imposition of structure on the parameters of an infinite mixture model 
is an increasingly important topic. In this light, our notion of evolutionary 
diffusion down a tree sits within the larger class of models that construct 
dependencies between distributions on random measures [MacEachern, 1999, 
MacEachern et al., 2001, Teh et al., 2006]. 
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